In brief
Most time in data analysis is spent ‘tidying up’ data: getting it into a suitable format to get started. Data scientists have a particular definition of tidy: Tidy datasets are “easy to manipulate, model and visualize, and have a specific structure: each variable is a column, each observation is a row” (Wickham 2014).
It’s often not convenient for humans to enter data in a tidy way, so untidy data is probably more common than tidy data in the wild. But doing good, reproducible science demands that we document each step of our processing in a way that others can check or repeat in future. Tools like R make this easier.
Overview
In previous worksheets we used various commands in the tidyverse, like filter and group_by.
If you want to recap these commands you can use the cheatsheet, especially the part on groups and summaries.
Today we will cover three additional techniques which are important when working with real datasets:
- Creating new variables/columns
- ‘Pivoting’ or reshaping data from long to wide formats (or the reverse)
- Joining two sources of data (e.g. two spreadsheets) into a single dataframe
Before you start
- Make sure you complete the worksheet on exploring data and plotting
- Remember to load the tidyverse
library(tidyverse)Code review
TODO ADD ACTIVITY HERE
Making new variables
Sometimes we need to create new columns in our dataset. For example, let’s say we wanted to calculate someone’s BMI from their weight and height.
There is a built in dataset called women, which contains heights and weights of 15 women in lbs and inches:
women %>% glimpse
> Rows: 15
> Columns: 2
> $ height <dbl> 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72
> $ weight <dbl> 115, 117, 120, 123, 126, 129, 132, 135, 139, 142, 146, 150, 15…To calculate a BMI we first need to convert the heights to from inches to meters, and the weights from lbs to kilograms.
metric_women <- women %>%
mutate(
height_m = height*0.0254, # approx conversion from inches to m
weight_kg= weight*0.45 # conversion from lbs to kg
)Explanation: We used the mutate function to covert lbs to kg and inches to m, and saved these two new columns in a new dataset called metric_women.
We can see the new columns here:
metric_women %>% head(3)
> height weight height_m weight_kg
> 1 58 115 1.4732 51.75
> 2 59 117 1.4986 52.65
> 3 60 120 1.5240 54.00BMI is calculated as \(\dfrac{kg}{m^2}\). We can use mutate again for this:
metric_women %>%
mutate(BMI = weight_kg / height_m^2) %>%
head(3)
> height weight height_m weight_kg BMI
> 1 58 115 1.4732 51.75 23.84443
> 2 59 117 1.4986 52.65 23.44374
> 3 60 120 1.5240 54.00 23.25005Explanation: We used mutate again to make a new column, BMI. This contains womens’ weight divided by their squared height (^2 means to the power of 2 in R-speak).
- Create a density plot of BMI scores in the
womendataset. It should look like this:
- What is the median BMI in the sample?
Pivoting (reshaping)
Data is commonly stored in either wide or long format.
If you used SPSS to do a t-test or ANOVA during your undergraduate degree, you likely analysed the data in wide format.
In wide format, each row represents the observations from a single participant. Each measurement for a given participant are stored in separate columns.
This is often called row per subject data. An example is the built in attitude dataset:
attitude %>%
head()
> rating complaints privileges learning raises critical advance
> 1 43 51 30 39 61 92 45
> 2 63 64 51 54 63 73 47
> 3 71 70 68 69 76 86 48
> 4 61 63 45 47 54 84 35
> 5 81 78 56 66 71 83 47
> 6 43 55 49 44 54 49 34Explanation: Each row contains scores for a particular employee on various measures. To find out more about these data you can type ?attitude into the console.
Let’s say we want a single plot of all these variables, something like this:
To do this we first need to convert the data to long format. In long format, each observation is saved in its own row, rather than across multiple columns.
It’s often called “row per observation” data.
Pivoting is where you take a long data file (lots of rows, few columns) and make it wider. Or where you take a wide data file (lots of columns, few rows) and make it longer.
We can convert from wide to long using the pivot_longer command.
Another term sometimes used for pivoting data to long form is melting img: TrueWarrior
To see why the command is called ‘pivot_longer’, imagine trying to reshape just the first two rwos of the attitude dataset:
> rating complaints privileges learning raises critical advance
> 1 43 51 30 39 61 92 45
> 2 63 64 51 54 63 73 47
If we use pivot_longer on this selection, we end up with this:
attitude %>%
pivot_longer(everything())
> # A tibble: 210 x 2
> name value
> <chr> <dbl>
> 1 rating 43
> 2 complaints 51
> 3 privileges 30
> 4 learning 39
> 5 raises 61
> 6 critical 92
> 7 advance 45
> 8 rating 63
> 9 complaints 64
> 10 privileges 51
> # … with 200 more rowsExplanation of the command: We selected a subset of columns and rows. Then we used pivot_longer(everything()) to make this into long form data. The everything() tells R to merge values from all of the columns into a single new column called value, and to keep track of the original variable name in a new column called name)
The change works like like this:
Converting from wide format to long format
In this example we don’t have an explicit record of which participant was which in the attitude dataset, because the mapping to participants was implicit: each row was a new participant, but they were not marked.
We can make this explicit by adding a new column to the data with the mutate and row_number() commands:
attitude_with_person <- attitude %>%
mutate(person = row_number()) %>%
head(2)
attitude_with_person
> rating complaints privileges learning raises critical advance person
> 1 43 51 30 39 61 92 45 1
> 2 63 64 51 54 63 73 47 2But now, if we pivot_longer again, we will need to tell R which columns we would like to pivot. If we don’t the person column gets melted with everything else so we lose track of which response belonged to which participant:
attitude_with_person %>%
pivot_longer(everything())
> # A tibble: 16 x 2
> name value
> <chr> <dbl>
> 1 rating 43
> 2 complaints 51
> 3 privileges 30
> 4 learning 39
> 5 raises 61
> 6 critical 92
> 7 advance 45
> 8 person 1
> 9 rating 63
> 10 complaints 64
> 11 privileges 51
> 12 learning 54
> 13 raises 63
> 14 critical 73
> 15 advance 47
> 16 person 2We can exclude person from the pivoting by writing:
Explanation of the command: Here, we still use pivot_longer but this time we put -person between the parentheses. The minus sign, -, means don’t include this variable, so -person ends up meaning include all columns except person, which is what we wanted.
Use the tidyverse commands you know plus pivot_longer to produce this plot using the attitude data:
Pivoting data to make summaries
Imagine we want a table of the mean score for each question in the attitude dataset.
This would be fiddly if we just tried to use summarise on wide format data. But if we use pivot_longer, group_by and then summarise (in that order) it’s possible to take the data and make a table like this with 3 instructions to R:
> # A tibble: 7 x 3
> Name Mean SD
> <chr> <dbl> <dbl>
> 1 advance 42.9 10.3
> 2 complaints 66.6 13.3
> 3 critical 74.8 9.89
> 4 learning 56.4 11.7
> 5 privileges 53.1 12.2
> 6 raises 64.6 10.4
> 7 rating 64.6 12.2
Combine the pivot_longer, group_by and summarise commands (in that order) to reproduce the table above.
Hints:
- You want to pivot all of the variables in the attitude dataset this time
- We covered using
summarisein the first and second worksheets. If you want a reminder, the cheatsheet might be the quickest place to look
Pivoting wider
Sometimes we have the opposite problem: We have long data, but want it in wide format. For example, we might want a table where it’s easy to compare between different years, like this:
gapminder::gapminder %>%
filter(year > 1990) %>%
pivot_wider(id_cols=country, names_from=year, values_from=gdpPercap) %>%
head(3) %>%
pander::pander("GDP per-capita in 3 countries in 3 different years, from the gaminder dataset.")| country | 1992 | 1997 | 2002 | 2007 |
|---|---|---|---|---|
| Afghanistan | 649.3 | 635.3 | 726.7 | 974.6 |
| Albania | 2497 | 3193 | 4604 | 5937 |
| Algeria | 5023 | 4797 | 5288 | 6223 |
Instead of pivot_longer, now we want to pivot_wider.
As we saw before the gapminder data is a fairly long format. There are multiple rows per-country corresponding to different years.
Let’s say we want to pivot gapminder to a wide format to compare GDP in different years. We first need to select the data we want — country, year and GDP:
gapminder1990s <- gapminder::gapminder %>%
select(country, year, gdpPercap) %>%
filter(year >= 1990)Then we pivot_wider:
gapminder1990s %>%
pivot_wider(names_from = year, values_from = gdpPercap) %>%
head()
> # A tibble: 6 x 5
> country `1992` `1997` `2002` `2007`
> <fct> <dbl> <dbl> <dbl> <dbl>
> 1 Afghanistan 649. 635. 727. 975.
> 2 Albania 2497. 3193. 4604. 5937.
> 3 Algeria 5023. 4797. 5288. 6223.
> 4 Angola 2628. 2277. 2773. 4797.
> 5 Argentina 9308. 10967. 8798. 12779.
> 6 Australia 23425. 26998. 30688. 34435.Explanation of the command and output: We started with multiple rows per country, corresponding to years. We used pivot_wider with names_from = year to create new columns for each year in the data. We used values_from=gdpPercap to tell pivot_longer to use the GDP numbers to populate the table. The resulting table helps us compare years within countries, or between countries for a given year.
Experiment for yourself with pivot_longer and pivot_wider:
- Try using pivot_longer with some different datasets, perhaps
irisandmtcars - Use the pivoted data with
group_byto create summary tables
All content on this site distributed under a Creative Commons licence. CC-BY-SA 4.0.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (1): 1–23. https://doi.org/10.18637/jss.v059.i10.